Goto

Collaborating Authors

 ak 1


Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

arXiv.org Machine Learning

One widely recognized interpretation for their empirical success is their ability to perform in-context learning (ICL): pretrained transformers are capable of performing previously unseen tasks based on demonstrations and examples in the prompt, without requiring any additional task-specific fine-tuning (Brown et al., 2020). A line of recent works interpret the in-context learning (ICL) capability of transformers from an algorithmic perspective, viewing transformers as models that can implicitly execute certain learning algorithms on the context examples. Specifically, Garg et al. (2022) proposes a theoretical framework for ICL in terms of learning a hypothesis class, and empirically shows that transformers can in-context learn the linear function class. Motivated by this empirical finding, several recent works attempt to theoretically study how transformers perform in-context learning on linear regression tasks. Aky urek et al. (2022); Von Oswald et al. (2023) construct multi-layer transformers with linear attention that can execute gradient descent on the an "in-context loss" defined on the context data, thereby enabling in-context learning of linear regression.


8d2a5f7d4afa5d0530789d3066945330-Supplemental.pdf

Neural Information Processing Systems

A.4 ResultsonCIFAR-10andCIFAR-100 In this section, we report results on CIFAR with different sizes of ResNets: ResNet-20 (RN20), ResNet-32(RN32),ResNet-44(RN44),ResNet-56(RN56),ResNet-110(RN110). We report results on CIFAR-10 in Table 15, and results on CIFAR-100 in Table 16.


On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes

arXiv.org Machine Learning

The noise in stochastic gradient descent (SGD), caused by minibatch sampling, remains poorly understood despite its enormous practical importance in offering good training efficiency and generalization ability. In this work, we study the minibatch noise in SGD. Motivated by the observation that minibatch sampling does not always cause a fluctuation, we set out to find the conditions that cause minibatch noise to emerge. We first derive the analytically solvable results for linear regression under various settings, which are compared to the commonly used approximations that are used to understand SGD noise. We show that some degree of mismatch between model and data complexity is needed in order for SGD to "cause" a noise, and that such mismatch may be due to the existence of static noise in the labels, in the input, the use of regularization, or underparametrization. Our results motivate a more accurate general formulation to describe minibatch noise.